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Abstract 


Semi-Markov processes have proved to be an effective and con- 
venient tool for constructing models of systems that achieve reliability 
by redundancy and reconfiguration. These models are able to depict 
complex system architectures and to capture the dynamics of fault 
arrival and system recovery. A disadvantage of this approach is that the 
models can be extremely large , which poses both a model construction 
and a computational problem. Techniques are needed to reduce the 
model size. Because these systems are used in critical applications 
where failure can be expensive , there must be an analytically derived 
bound for the error produced by the model reduction technique. This 
report presents a model reduction technique called trimming that can 
be applied to a popular class of systems. Automatic model generation 
programs have been written to help the reliability analyst produce models 
of complex systems. This method (trimming) is easy to implement and 
its error bound easy to compute. Hence , the method lends itself to 
inclusion in an automatic model generator. 


Introduction 

Reliable digital control systems are being de- 
signed using redundancy and reconfiguration. The 
reliability requirement for these systems can be ex- 
tremely high. An example is the proposed require- 
ment that the flight control system for a commercial 
aircraft have less than one chance in a billion of fail- 
ure during a 10-hour flight. Such requirements are 
beyond what can be established by natural life test- 
ing. An alternative method is to estimate the prob- 
ability of system failure with a semi-Markov model 
that captures the elements of system architecture, 
component failure, and system recovery from failed 
components. The system architecture can be de- 
scribed by considering the components and how they 
interact. The component failure rate is obtained from 
field data. The description of system recovery from 
failed components is determined from fault-injection 
experiments in the laboratory. These three features 
(system architecture, component failure, and system 
recovery) can be studied separately and then com- 
bined to form the reliability model for the system. 
Semi-Markov processes with their states representing 
the states of the system and their transitions between 
states representing fault occurrences and system re- 
coveries have proved to be an effective and convenient 
reliability estimation tool. 

A major obstacle is that a reconfigurable sys- 
tem of moderate size and complexity can generate 
an enormous semi-Markov model, producing both 
a model construction problem and a computational 
problem. The model construction problem has be- 
come severe enough that computer programs have 
been written to automatically generate reliability 


models. These automatic model generators have in- 
tensified the computational problem, since it is now 
possible to produce models of extremely complex 
systems. 

Sound and effective procedures are needed for 
model reduction. Since these models describe sys- 
tems that need to be highly reliable, an acceptable 
model reduction method must have an analytically 
derived error bound. Since the model reduction 
method presented in this report is easy to implement 
and its error bound easy to compute, it lends itself to 
inclusion in an automatic model generator. In fact, 
it is currently being developed as a feature of the 
automatic model generator called ASSIST (ref. 1). 

We call the procedure model reduction by trim- 
ming, or just trimming (ref. 2). In the next section, 
we illustrate trimming by means of a concrete exam- 
ple. Then in subsequent sections trimming is pre- 
cisely defined, the theorem for the error bound on 
trimming is precisely stated, and the trimming bound 
is derived. Finally we show that not all models can 
be trimmed and still yield an accurate estimate of 
reliability. It is essential to determine the error pro- 
duced by trimming. 

Illustrative Example 

This section uses a simple example to illustrate 
the basic ideas of model reduction by trimming. This 
example is not completely realistic in engineering 
terms, but it covers the ideas in a concrete manner. 
Suppose a system consists of four central processor 
units, four memories, and four buses. In the initial 
configuration, three components of each type are ac- 
tive while the fourth is a cold spare (with zero failure 
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rate). If a component becomes faulty, it is replaced 
by a spare. If the number of good processors, good 
memories, or good buses falls below three, then the 
entire system goes into a simplex configuration con- 
sisting of one processor, one memory, and one bus. 
In this simplex configuration, the failure of any of the 
three components causes system failure. The initial 
configuration, showing only the active components, is 
displayed in figure 1. In this initial configuration, and 
in the subsequent configurations as a triad, each pro- 
cessor (CPU) is assigned one bus to use for sending 
data to all three memories. Each memory (MEM) re- 
ceives data from all three processors. Similarly, each 
memory is assigned one bus to use for sending data 
to all three processors. Each processor receives data 
from all three memories. 
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Figure 1. Initial configuration of processors, memories, and 

buses. 

The system begins an operation cycle with the 
active processors requesting data from the memories. 
Each memory (on its assigned bus) sends data to all 
three processors. Each processor votes on the re- 
ceived data (i.e., the data from the three memories 
are compared to detect a fault should the data dis- 
agree) and performs its calculations. After comput- 
ing, each processor (on its assigned bus) sends data 
to all three memories. An operation cycle ends when 
each memory votes and stores the data. 

Some critical coupling exists between the proces- 
sors and buses in the sense that the system can have 
a coincident-fault failure when one fault is in a pro- 
cessor and the other fault is in a bus. For example, 
suppose the processors are sending data to the mem- 
ories with processor i using bus i for i = A, S, C . If 
processor A and bus B are faulty, then the memory 
voters can be overwhelmed by incorrect data. How- 
ever, if processor A and bus A are faulty, then the 


memories will vote correctly. There is similar critical 
coupling between the memories and the buses. There 
is no critical coupling between the processors and the 
memories. 

Part of the reliability model for this system is 
shown in figure 2. The failure rates for processors, 
memories, and buses are A p, Ajv/, and A#, respec- 
tively. For convenience, the system recoveries are as- 
sumed to be constant-rate transitions with Fp, Fjv/, 
and Fp being the system recovery rates for proces- 
sors, memories, and buses. The states are denoted by 
A for a fault-free state, R for a single-fault recovery- 
mode state, V for a multiple-fault recovery-mode 
state, and X and Y for system failure states. 

To illustrate model reduction by trimming, con- 
sider state Rp of figure 2 where one of the active 
processors has become faulty. The recovery transi- 
tion Fp removes this faulty processor and replaces 
it with the spare. The transition 2Ap + 2Ap rep- 
resents the failure of another processor or of a bus 
that is critically coupled to the failed processor. The 
state X represents system failure because of coinci- 
dent faults. The transitions 2A71/, Ajv/, and A p repre- 
sent fault arrival in components that are not critically 
coupled to the faulty processor. It seems reasonable 
to think that these last three transitions and their 
subsequent states can be ignored with negligible loss 
of accuracy, because even after these transitions there 
must be another component failure before there is 
system failure. We call such states as Rp recovery- 
mode states. A recovery-mode state is a state with 
a recovery transition out of it. Model reduction by 
trimming eliminates all component failure transitions 
from recovery-mode states that do not cause imme- 
diate system failure. 

Two models, complete and trimmed, were con- 
structed for this system using the ASSIST reliability 
model generator (ref. 3). The complete model con- 
tains 227 states and took 7878 cpu (central process- 
ing unit) seconds to compute on a Digital Equipment 
Corp. VAX 11/750 computer. The trimmed model 
contains 83 states and took 258 cpu seconds to com- 
pute on the same computer. Different methods of 
constructing a reliability model can produce equiv- 
alent models with different numbers of states, but 
the relative difference between the complete and the 
trimmed model is thought to remain the same. 

For the parameter values, 

A p = 10 _4 /hour Fp = 10 4 /hour 

A a/ = 5 x 10 -4 /hour Fm = 10 3 /hour 

A p = 10~ 5 /hour Fp = 10 3 /hour 
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Figure 2. Reliability model for the illustrative example in figure 1. 


and an operating time of T = 1 hour, the complete 
model returns for the probability of system failure 

P(Failure of complete model) == 1.80803726 x 10 -9 

while the trimmed model returns 

P (Failure of trimmed model) = 1.80803714 x 10 9 

The absolute error produced by trimming is 1.2 x 
10 -16 . The relative error (absolute error/true value) 
is 6.6 x 10 -8 , which is about 1 part in 10 million. 
This example suggests that trimming significantly 
reduces model size and computational effort while 
producing an insignificant amount of error. 


The trimming bound gives an upper bound on the 
absolute error produced by trimming. The trimming 
bound divided by the returned value for the trimmed 
model gives an upper bound for the relative error 
produced by trimming. In practice, if this upper 
bound for the relative error is small enough, then the 
trimmed model is acceptable. The amount of relative 
error that is acceptable varies with the application, 
but a relative error of 10 percent or less is usually 
considered acceptable. 

This practice (of considering only the trimmed 
model and the trimming error bound) can be il- 
lustrated for this example. As will be shown sub- 
sequently, the upper bound for trimming error is 

TRMBND = 6n{e eT - 0T - 1) 
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where 

6 maximum sum of the rates for the 

failure transitions leaving any state 

li largest average holding time for all 

recovery-mode states 

T operating time 

For this example, 

0 = 3Ap -f 3A m + 3 A# 

= (3 x 1(T 4 + 15 x 10" 4 + 0.3 x 10 -4 )/hour 
= 18.3 x 10 _4 /hour 
H = 1 /Fm — 10~ 3 hour 

since F 'm is the slowest recovery rate, and T = 
1 hour. The computed bound on the trimming error 
is 

TRMBND = 3.066 x 10“ 12 

This value of 3.066 x 10 — ^ 2 is a bound on the 
absolute error produced by trimming. An upper 
bound for the relative error introduced by trimming 
is 

Relative _ TRMBND 
error Trimmed model result 

= (3.066 x 10 -12 )/ (1.80803714 x 10 -9 ) 

- 1.7 x 10 -3 

which indicates that this model can be trimmed with 
negligible loss of accuracy. Note that the relative 
error for the upper bound of 1.7 x 10 -3 is obtained 
without using the complete reliability model. The 
decision to use only the trimmed model has been 
made on the basis of the results from the trimmed 
model and the error bound for trimming. 

The results for this example are typical for an 
application of trimming. The number of states in 
the reliability model is reduced by about half. The 
computational effort is reduced by about an order of 
magnitude. The actual error from trimming is in- 
significant. The derived error bound for trimming is 
much larger than the actual error, but the derived er- 
ror bound is still small compared with the computed 
probability of failure for the system. 


Description of Model Trimming and 
Statement of the Trimming Bound 
Theorem 

A common approach to achieving reliability is to 
have three or four components perform a majority 
vote. When a component becomes faulty and dis- 
agrees with the majority, it is discarded from the 
system and replaced by a spare if a spare is available. 
There are two failure modes: (1) a coincident-fault 
failure when the voter is overwhelmed because a sec- 
ond component becomes faulty before a first faulty 
component can be removed and (2) an exhaustion-of- 
parts failure when the number of good components 
falls below a minimum level. Almost all fault-tolerant 
systems currently being considered are assemblages 
of subsystems each of which is a majority- voting sys- 
tem of the type described above. 

For this class of systems, a reliability model 
has normal-operating states where all faulty com- 
ponents (if any) have been removed from the sys- 
tem, recovery-mode states where a faulty component 
has not yet been removed from the system, and ab- 
sorbing states where the system has failed because 
of coincident faults or exhaustion of parts. We ex- 
amine the recovery-mode states more closely. There 
are three types of transitions from a recovery-mode 
state: (1) system recovery, (2) failure of another com- 
ponent that causes immediate system failure (either 
coincident fault or exhaustion of parts), or (3) failure 
of another component that does not cause immediate 
system failure. Note that the third type of transition 
is not a transition to a system failure state. Model 
reduction by trimming is accomplished by removing 
all transitions of the third type (and their subsequent 
states) from the model. 

Theorem : Suppose that 

1. Components fail at a low constant rate. 

2. Fault recovery depends only on the time since 
fault occurrence. 

3. The system is an assemblage of subsystems, 
each subsystem achieving fault tolerance by a 
three-way or four-way majority vote. 

4. All transitions to system failure are compo- 
nent failure transitions. (This assumption 
eliminates pathological cases.) 

For each state i in the reliability model let 

6i sum of the component failure rates 

out of state i 

6 maximum value among the 
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Figure 3. General path in a semi-Markov reliability model. 


/Xj average holding time in recovery- 

mode state Rj 

/x maximum value among the fij 

T system operating time 

Then an error bound for model reduction by trim- 

ming is 

TRMBND = 0n(e 6T - OT - 1) 

Derivation of the Trimming Bound 

The error bound for model reduction by trim- 
ming is obtained from a theorem that places an up- 
per bound on the probability of traversing a path in 
a semi-Markov reliability model by time T (refs. 4 
and 5). In such a model, the component failures are 
assumed to occur at a low constant rate, and system 
recovery is allowed to be a fast, arbitrary distribution 


that depends only on the time elapsed since compo- 
nent failure. A general path in such a semi-Markov 
reliability model is shown in figure 3. The global time 
independence of a semi-Markov process permits the 
rearrangement of states on the path for notational 
and computational convenience (refs. 4 and 5). In 
figure 3, small Greek letters represent slow constant- 
rate failure transitions, while capital roman letters 
represent fast system recovery transitions. The first 
line in figure 3 contains the states ( A *.) with only 
slow constant-rate failure transitions (A*, and 7*.). In 
the second line are states ( Bi ) where the successful 
(on-path) transitions are the fast recovery transitions 
(F^i) competing with slow constant-rate fault tran- 
sitions (e*) and possibly other fast transitions (F^ ). 
In the third line are the states ( Cj ) where the suc- 
cessful (on-path) transitions are the slow fault occur- 
rences ( otj ) competing against one or more recoveries 
{Gj, C j) and possibly other fault transitions (fy). For 
notation, let 


5 




0 



Figure 4. Path diagram for the derivation of the trimming bound. 


P(Fi) probability that is successful 

fi{Cj) average holding time in state Cj 

considering only recovery transitions 

An upper bound for the probability of traversing the 
path in figure 3 by time T is 



The general model for using the algebraic upper 
bound in equation (1) to derive the trimming bound 
is shown in figure 4. This general model displays 
all the paths to the possible system failure states 
in a reliability model for the class of system that 
we are considering. Since the model in figure 4 is 
potentially infinite, it includes transient faults and 
their potentially infinite occurrences. 

The system starts in state Aq which is a fault- 
free state. In this initial state, some component 


failures can take the system immediately to system 
failure. These component failures are represented by 
the transition eo to the system failure state X. Other 
components fail with rates c*i,...,c*i and take the 
system to recovery-mode states f?i, ..., Ri. From each 
of the R states, the diagram displays the three types 
of transitions out of a recovery-mode state discussed 
above. The e transitions into X are the component 
failures that cause immediate system failure. The 
F transitions from recovery-mode states to fault- 
free states represent the possible system recovery 
actions. The 7 transitions represent the component 
failures out of a recovery-mode state that do not 
cause immediate system failure. 

After a 7 transition, three simplifying assump- 
tions are made about system behavior: 

1. After a 7 transition, the system is no longer able 
to remove failed components from the system. 

2. After a 7 transition, any other component failure 
causes immediate system failure. 
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3. This last transition (causing immediate system 
failure) occurs at rate 9 which is the largest 
possible rate for a transition to a failure state, 
since 6 is the maximum sum of the failure rates 
out of any state. 


All these assumptions increase the computed prob- 
ability of system failure (compared with the actual 
probability of system failure). 

Returning to the main sequence of component 
failure and system recovery, the F recovery transi- 
tions out of the R states go to the fault-free B states 
where the cycle of component failure and system re- 
covery begins again. 

An upper bound is obtained for the probability 
of being in state Yi in figure 4 by considering all 
the paths to this failure state. One such path is the 
three-step transition from Aq to R{ by <*$, from 
to V{j by 7 y, and from to Y\ by 9 . An upper 
bound for traversing this path by time T given by 
formula ( 1 ), the algebraic upper bound, is 

UB = (a i T)( 7 y/x)(0|) (2) 

where a slow failure transition competing with other 
failure transitions contributes the first factor, a slow 
failure transition competing with recovery transitions 
when the holding time in the recovery-mode state is 
less than or equal to /x contributes the second factor, 
and a second slow failure transition contributes the 
third factor. Summing over the fan of transitions 
from Aq and the fan of transitions from the Rfs gives 


P(Yi) < 



i 



(3) 

(4) 


for traversing this path by time T is given by for- 
mula ( 1 ) as 


UB = {aiT)[P{F^)\ (fly,*!) (7 
Summing over all the fans gives 



mo<EEEE 

ijkQ 


^3 

(7) 



Since the sum of the failure rates is less than or equal 
to 0 , and the sum of the probabilities for the recovery 
transitions F{j is less than or equal to 1 , 

t3^3 

P(Y 2 ) < (9) 

In general, 

rnk-\-\gk + 1 

™ 5 *"lTTT)r (10) 


Summing all these bounds for the s gives a trim- 
ming bound of 


oc 

TRMBND < E P ( Y k ) 
k = 1 

°° rpk+l Qk~\~\ 

= 0n(e 6T -6T- 1) (11) 


Since the sum of the failure rates out of any state is 
less than or equal to 0 , 

p ( Y i) < ( 5 ) 

A typical path from Ao to Y 2 has the five transi- 
tions from Aq to Ri by a*, from R{ to B^j by F hi * 
from Bij to by Pij,ki f rom Sij,k to ViJ,k,q by 
7 and from V { ^ q to Y 2 by 9 . An upper bound 


An Example With a Large Error Bound 

A theorem on the error produced by trimming 
is necessary since trimming does not always have a 
negligible effect. Consider a system consisting of four 
reconfigurable fourplexes. Each fourplex removes 
itself from the system when the fourplex recovers 
from the second fault occurrence in that fourplex. 
The system fails by exhaustion of parts when all 
four fourplexes have removed themselves from the 
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system. A system coincident-fault failure occurs 
if any fourplex has a coincident-fault failure. In 
addition, all reconfiguration ceases if there are two 
faults present in two different fourplexes. In this case 
the system fails upon the occurrence of a third fault 
anywhere in the system. For a component failure 
rate of 10“ 4 /hour, a recovery rate of 10 3 /hour, and 
an operating time of 700 hours, the error bound for 
trimming is 

TRMBND = 1.5 x 10 -6 

The computed probability of system failure using a 
trimmed model is 

P(Failure trimmed model) = 7.05 x 10 -7 

The error bound for trimming is larger than the value 
returned by the trimmed model. Hence, the the- 
ory indicates that this model should not be trimmed. 
The computed probability of failure using the com- 
plete model is 

P(Failure complete model) = 1.16 x 1CP 6 
Trimming this model produces a significant error. 

Concluding Remarks 

This report has presented a method of model 
reduction called trimming and has derived an error 
bound for this method of reducing the number of 
states in a semi-Markov reliability model. The error 
bound uses only three parameters from the semi- 
Markov model: the maximum sum of rates for failure 
transitions leaving any state, the maximum average 
holding time for a recovery-mode state, and the 
operating time for the system. The error bound 
can be computed before any model generation takes 
place so that the modeler can decide immediately 
whether or not the model can be trimmed. The 
trimming has a precise and easy description which 


makes it easy to include in a program that generates 
reliability models. This report has presented the 
simplest version of the error bound for trimming. 
Tighter bounds can be obtained by requesting more 
information about the system being modeled. For 
example, the current bound does not require any 
information about system recovery from multiple 
faults. Conducting the necessary experiments and 
including this information in the derivation of the 
error can produce a tighter bound. The price of the 
tighter bound is the cost of the experiments. 

This method of model reduction is currently be- 
ing developed as a feature of the automatic model 
generator called ASSIST. 

NASA Langley Research Center 
Hampton, VA 23665-5225 
March 29, 1991 
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